Heat Map for Customer Transaction Data#

In this example, the heat map is used to visualize purchasing or transaction data of an IOS App.

The purpose is to display the customer behaviour history (e.g. when the transactions occurred), given n transaction opportunities for each customer. Additionally, it shows what type of transaction took place.

The types of transactions include installing, converting from installing to free trials, and converting from free trials to paid subscriptions.

Raw Data Review#

The case study data comes from two files:

  1. installs.csv: Contains information about the install date and segmentation for all the users segmented for this experiment.

    • user_id: A numeric user identifier.

    • segment: A number from 0 to 2 identifying the segment the user was assigned to when they first opened the app.

    • install_date: The date the user opened the app for the first time, in the format YYYY-MM-DD.

  2. transactions.csv: Contains information about purchases made by users in the app, including free trial purchases.

    • transaction_date: The date a transaction was completed, in the format YYYY-MM-DD.

    • user_id: A numeric user identifier.

    • is_free_trial: True if the transaction is a free trial.

    • is_conversion_to_pay: True if the transaction is the first paid transaction after a free trial.

    • price_in_usd: The price the user paid for the transaction. It is null if the transaction was a free trial.

# Import packages
import pandas as pd
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go
import datetime as dt
import numpy as np
# Load data
installs = pd.read_csv('/Users/maxandchang/Documents/JupyterBook/WikiScratch/dataCollection/installs.csv')
transactions = pd.read_csv('/Users/maxandchang/Documents/JupyterBook/WikiScratch/dataCollection/transactions.csv')

# Convert transaction_date and install_date to datetime
installs['install_date'] = pd.to_datetime(installs['install_date'])
transactions['transaction_date'] = pd.to_datetime(transactions['transaction_date'])

# Combine 'installs' and 'transactions' based on the *user_id* and discard the data of no transactions after installing
df = pd.merge(installs, transactions, on='user_id', how='inner')

# Converting the date columns to datetime format
df['install_date'] = pd.to_datetime(df['install_date'])
df['transaction_date'] = pd.to_datetime(df['transaction_date'])

# Replace 'Nan" with 0 under the price_in_usd column.
df['price_in_usd'] = df['price_in_usd'].fillna(0)

# show dataframe head
print(df.head())
   user_id  segment install_date transaction_date  is_free_trial  \
0        5        1   2023-09-09       2023-09-09           True   
1       10        0   2023-09-19       2023-09-19           True   
2       10        0   2023-09-19       2023-09-26          False   
3       22        1   2023-09-04       2023-09-04           True   
4       22        1   2023-09-04       2023-09-11          False   

   is_conversion_to_pay  price_in_usd  
0                 False          0.00  
1                 False          0.00  
2                  True          2.99  
3                 False          0.00  
4                  True          2.99  

Gain the start date and the end date

# Finding the earliest and latest dates
earliest_install_date = df['install_date'].min()
latest_install_date = df['install_date'].max()
earliest_transaction_date = df['transaction_date'].min()
latest_transaction_date = df['transaction_date'].max()

print("Earliest install date:", earliest_install_date)
print("Latest install date:", latest_install_date)
print("Earliest transaction date:", earliest_transaction_date)
print("Latest transaction date:", latest_transaction_date)
Earliest install date: 2023-09-01 00:00:00
Latest install date: 2023-09-29 00:00:00
Earliest transaction date: 2023-09-01 00:00:00
Latest transaction date: 2024-03-25 00:00:00

The information we gathered from the raw data:

  • Customers are divided into three segments, likely for a future A/B test.

  • Regardless of the segments, each piece (row) of transaction data represents a behavior of the customer. These behaviors are categorized as follows:

    1. TF: is_free_trial is True, is_conversion_to_pay is False -> Customers start their free trials.

    2. FT: is_free_trial is False, is_conversion_to_pay is True -> Customers make their first payment after free trials.

    3. FF: is_free_trial is False, is_conversion_to_pay is False -> Customers who previously paid continue to pay.

    4. The TT scenario does not exist.

  • Earliest install date: 2023-09-01 00:00:00 <-> Latest install date: 2023-09-29 00:00:00

  • Earliest transaction date: 2023-09-01 00:00:00 <-> Latest transaction date: 2024-03-25 00:00:00

  • From 2023-09-01 00:00:00 to 2024-03-25 00:00:00, there are 30 weeks of transaction opportunities. Therefore, we can convert ‘date’ time to ‘week’ time.

  • Set 2023-09-01 00:00:00 as the start date and first week.

Reshape the DataFrame to be Suitable for Heatmaps#

Add a ‘status’ Column to Denote TF, FT, FF#

We can also add a suffix to indicate the amount of money earned from the transaction behaviors of customers.

# Create a new column 'status' by combining the first letters from 'is_free_trial' and 'is_conversion_to_pay'.
# We can also add a prefix to indicate the amount of money earned from the transaction behaviours of customers.
df2 = df.copy()
df2['status'] = df2['is_free_trial'].astype(str).str[0] + \
                df2['is_conversion_to_pay'].astype(str).str[0]+ \
                        df2['price_in_usd'].astype(str)
print(df2.head())
   user_id  segment install_date transaction_date  is_free_trial  \
0        5        1   2023-09-09       2023-09-09           True   
1       10        0   2023-09-19       2023-09-19           True   
2       10        0   2023-09-19       2023-09-26          False   
3       22        1   2023-09-04       2023-09-04           True   
4       22        1   2023-09-04       2023-09-11          False   

   is_conversion_to_pay  price_in_usd  status  
0                 False          0.00   TF0.0  
1                 False          0.00   TF0.0  
2                  True          2.99  FT2.99  
3                 False          0.00   TF0.0  
4                  True          2.99  FT2.99  

Add a ‘week’ Column to Represent the Transaction Time#

Convert the ‘date’ time to ‘week’ time. Set ‘2023-09-01’ as the start date for the first week and ‘2024-03-25’ as the end date for the last week.

# Define start and end dates
start_date = dt.datetime(2023, 9, 1)
end_date = dt.datetime(2024, 3, 25)

# Calculate the 'week' column
df2['week'] = ((df2['transaction_date'] - start_date).dt.days // 7) + 1
# Calculate the number of transaction opportunities
df2['n_trans_opp'] = ((end_date - df2['install_date']).dt.days // 7) + 1

print(df2.head())
   user_id  segment install_date transaction_date  is_free_trial  \
0        5        1   2023-09-09       2023-09-09           True   
1       10        0   2023-09-19       2023-09-19           True   
2       10        0   2023-09-19       2023-09-26          False   
3       22        1   2023-09-04       2023-09-04           True   
4       22        1   2023-09-04       2023-09-11          False   

   is_conversion_to_pay  price_in_usd  status  week  n_trans_opp  
0                 False          0.00   TF0.0     2           29  
1                 False          0.00   TF0.0     3           27  
2                  True          2.99  FT2.99     4           27  
3                 False          0.00   TF0.0     1           30  
4                  True          2.99  FT2.99     2           30  

Add an Auxiliary Column ‘status2’#

Since ‘TF0.0’, ‘FT2.99’, and ‘FF2.99’ cannot be recognised as colour categories in heatmaps, we add a numeric column ‘status2’. One can choose any three numbers to replace ‘TF0.0’, ‘FT2.99’, and ‘FF2.99’. For example, we may use 1, 2, and 3.

# Define conditions to check the 'status' column for specific values
conditions = [
    df2['status'] == 'TF0.0',  # Condition for 'TF0.0'
    df2['status'] == 'FT2.99',  # Condition for 'FT2.99'
    df2['status'] == 'FF2.99',  # Condition for 'FF2.99'
]

# Define the corresponding values to assign when the conditions are met
values2 = [1, 2, 3]

# Apply conditions and assign values to the new column 'status2'
# np.select applies conditions and assigns corresponding values from values2
# If none of the conditions are met, the default value np.nan is assigned
df2['status2'] = np.select(conditions, values2, default=np.nan)

# Display the first few rows of the DataFrame to verify the changes
print(df2.head())
   user_id  segment install_date transaction_date  is_free_trial  \
0        5        1   2023-09-09       2023-09-09           True   
1       10        0   2023-09-19       2023-09-19           True   
2       10        0   2023-09-19       2023-09-26          False   
3       22        1   2023-09-04       2023-09-04           True   
4       22        1   2023-09-04       2023-09-11          False   

   is_conversion_to_pay  price_in_usd  status  week  n_trans_opp  status2  
0                 False          0.00   TF0.0     2           29      1.0  
1                 False          0.00   TF0.0     3           27      1.0  
2                  True          2.99  FT2.99     4           27      2.0  
3                 False          0.00   TF0.0     1           30      1.0  
4                  True          2.99  FT2.99     2           30      2.0  

Reshape DataFrame: Convert Long DataFrame to Wide Matrix#

Reshape the DataFrame for use in a heatmap chart with Plotly.

# Extract columns needed by the heatmap
df3 = df2[['status2','week','user_id','segment']]

# Seperate different segment groups
df_seg0 = df3[df3['segment'] == 0]
df_seg1 = df3[df3['segment'] == 1]
df_seg2 = df3[df3['segment'] == 2]

# Convert long dataframe to wide dataframe
df_seg0_wide = df_seg0.pivot(index="week", columns="user_id", values="status2")
# Convert the 'user_id' columns to string
df_seg0_wide.columns = df_seg0_wide.columns.astype(str)

df_seg1_wide = df_seg1.pivot(index="week", columns="user_id", values="status2")
df_seg1_wide.columns = df_seg1_wide.columns.astype(str)

df_seg2_wide = df_seg2.pivot(index="week", columns="user_id", values="status2")
df_seg2_wide.columns = df_seg2_wide.columns.astype(str)

# show examples
print(df_seg0_wide.head())
user_id   10   46   62   88  144  149  157  166  210  248  ...  149696  \
week                                                       ...           
1        NaN  NaN  NaN  NaN  1.0  NaN  NaN  NaN  NaN  NaN  ...     NaN   
2        NaN  NaN  NaN  NaN  2.0  1.0  NaN  NaN  1.0  1.0  ...     NaN   
3        1.0  NaN  NaN  1.0  NaN  2.0  1.0  1.0  NaN  NaN  ...     1.0   
4        2.0  1.0  1.0  2.0  NaN  3.0  2.0  NaN  NaN  NaN  ...     NaN   
5        NaN  2.0  2.0  NaN  3.0  NaN  NaN  NaN  NaN  NaN  ...     NaN   

user_id  149722  149742  149753  149804  149829  149916  149956  149967  \
week                                                                      
1           NaN     1.0     1.0     1.0     NaN     NaN     NaN     1.0   
2           1.0     2.0     NaN     2.0     NaN     NaN     1.0     NaN   
3           2.0     NaN     NaN     3.0     NaN     NaN     2.0     NaN   
4           NaN     3.0     NaN     NaN     1.0     1.0     NaN     NaN   
5           3.0     NaN     NaN     NaN     NaN     2.0     3.0     NaN   

user_id  149976  
week             
1           1.0  
2           2.0  
3           3.0  
4           3.0  
5           NaN  

[5 rows x 5078 columns]

Customer Behaviour History via Heatmap with Customised Colorbar#

# Set up layout parameters
figure_width =750
height_each = 250
colorscale = [
    [0, 'rgba(31, 119, 180, 0.8)'],
    [0.333334, 'rgba(31, 119, 180, 0.8)'],
    [0.333334, 'rgba(255, 127, 14, 0.8)'],
    [0.666667, 'rgba(255, 127, 14, 0.8)'],
    [0.666667, 'rgba(44, 160, 44, 0.8)'],
    [1, 'rgba(44, 160, 44, .7)']
]
# Define custom tick labels for the y-axis
colorbar_tickvals = [1.3333333, 2, 2.666667] # this is because we set the color number from 1 to 3.  
# the automatic tickvals are supposed to be from 1 to 3,  this can be seen if we do not change any tickvals.
# Since we want to palce the new tickvals in the middle of the three color sections in the colorbar, a simple calculation will do.
# 1.3333333 = 1 + (3-1)/6; 2 = 1 + (3-1)/2; 2.666667 = 3 - (3-1)/6

colorbar_ticktext = ['TF0.0', 'FT2.99', 'FF2.99'] # this is the text in the tick labels.

# Define the data visualisation chart as a function for reproduction of segment0, segment1, segment2
def create_heatmap(df_seg_wide, chart_title, figure_width, height_each, colorscale, colorbar_tickvals, colorbar_ticktext):
    # Define hover labels
    hover_labels = {1: 'TF0.0', 2: 'FT2.99', 3: 'FF2.99', np.nan: 'None'}
    
    # Create the heatmap
    fig = px.imshow(df_seg_wide, aspect="auto")
    
    # Update the trace with the custom hover template
    fig.update_traces(
        opacity=0.8, 
        hovertemplate='user_id: %{x}<br>week: %{y}<br>Status: %{customdata}<extra></extra>',
        customdata=df_seg_wide.replace(hover_labels).values
    )
    
    # Update layout
    fig.update_layout(
        title=f'{chart_title}',
        width=figure_width,
        height=height_each,
        margin=dict(l=2, r=2, t= 30, b=0),
        xaxis_title=None,
        coloraxis=dict(
            colorscale=colorscale,
            colorbar=dict(
                tickvals=colorbar_tickvals,
                thickness=25,
                ticktext=colorbar_ticktext
            )
        )
    )
    
    fig.update_yaxes(autorange=True, title='The n-<i>th</i> week')
    fig.update_xaxes(showticklabels=False)
    
    return fig

# Call the function
fig0 = create_heatmap(df_seg0_wide, 'segment0', figure_width, height_each, colorscale, colorbar_tickvals, colorbar_ticktext)
fig0.show()
fig1 = create_heatmap(df_seg1_wide, 'segment1', figure_width, height_each, colorscale, colorbar_tickvals, colorbar_ticktext)
fig1.show()
fig2 = create_heatmap(df_seg2_wide, 'segment2', figure_width, height_each, colorscale, colorbar_tickvals, colorbar_ticktext)
fig2.show()

Advantages of using a heatmap for customer transaction data:

  • It provides an overall view of the entire business data landscape.

  • It assists researchers in determining the direction for further analysis.

  • Data scientists can avoid repeatedly referring to raw data, ensuring no information is overlooked.

  • It offers clear insights into different types of customers and their approximate proportions.